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ABSTRACT - 

___ Stat ist ical methodologists havi sottet imes : cri t icized 
the use of conventional statistics in meta-analysis^ and iii recent 
years a number: of them have advocated the: use 6£ a special new --^ 
statistical methodology for research: synthesis. An examination of 
recent -books describing this methodology-shows that it is seriously 
liiiiitediiii_ its applicability to social science research findings. The 
new-nethodology produces interpretable me results only in 

exceptional circumstances (e.g. , when each study in a collection uses 
the same unblocked, posttest-only experimental design K The iiew 
statistical methodology for meta-analysis has produced 
uninterpretable results when applied to typical collections of social 
science studies with varied experimental designs. (Author) 
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Abstract 

_ _ _ Statist£»l methddologis ts sometimes^ cr iticized _ the use_ 
of-Conventlo^l_statlstlcs^ meta-axiaiysis^^^d ln_rec 
nufl^er of than faa^e advocate the use^^^^^ statistical 
methodology for research synthesis • ^ examination of recent 
books describing this methodology shows that it is seriously 
limited in its af^licability to social science research findings. 
The hew methodology produces ihterpretable meta- results 
only in exceptional circumstances. (e^g^^ tAieh each study ih a 
collection uses the same 

design)^ The new statistical methodoi meta-analysis has 

produced uninterpretable results when applied to typical 
collections of social science studies with varied es^erimental 
designs • 
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Operative azul Interpretabie Effect Sizes in Meta-analysis 



In a classic 1976 paper Glass defined set a-aalysis as the 
a|:¥>lication of statistical methods to results f rem a large 
cbllectiPh of studies for the puri^ findings. 
The statistical-methbds that„Slass used in ^ta'*-analysis vere 
cony^timal meSf stich as variance and muitiple _ 

regression amiysis In meta-analysis^ however , these- statistics 
were applied not to raw observations, but rather to effect-sizes , 
or standardized scores that represented the size of treatment 
effM:ts in different studieS:^pn a cotmion scale of standard j 
deviation units. Hedges {1982 ) has recently c on Glass's 

use of conventional statistics in research synthesis: 

Su^ use se^ed at first to be an innocuous extension ^^o^ 
statistical meth^s to a new situation.^ 1^ recent 
research has demonstrated that the use of such statistical 
procedures as analysis of variance and rigressipn analysiis 
cannot be justified for meta-analysis. For tunately; some iiew 
statistical procedures have been designed specifically for 
meta-analysis (p. 25). 

- -^ This new methodolo^ £or meta-an^ builds on statistical 
t^hniques originally developed by Cochran (1954) for testing tine 
homogeneity of results in related experiments and for making 
caQE)osite= estimates of peculation parameters from such results. 
Hedges (1981, 1982) first showed how Cochran's techniques could be 
applied to e:q)erimental results coded as eff^t sizes « Hunter^ 
Schmidt ^ ahd_ Jacksoh_J[1982I_and_Rosehth^ (1984} later advocated 
the use of .forrolas ahd te to those used-^ 

Hedges^ Hedges (1984^ -has refu^red to^ methods used by 
himself , Rosenthal^ md Hunta and Sdimidt as "mxlern statistical 
methods for meta-analysis." Bangert-browns (in press) has 
described the methods as "ai^roximate data pooling." 

Recent bck>lcs deioribi^ 6 Olkin^ 1985; 

Hunter^ Scaimtdt^_&_Ja(acsoh^_1982j and^Sbsmthal^ 1984) almost ^ 
^tirely-ignoze me 6f_t^ central^^^^ases in Glass's work: the 
estimation of effTCt sizes from studies with different 
experimental designs. Glass and his colleagues have argued that 
different procedures are needed for estimating effect sizes in 
si]ii>le e^eriments with unblodced j)o designs and in 

more caaplet experiments {Glass^ HCGawt £ Smithy 1981| HcGaw £ 
Slass^_1980).__Glass_and hiscblleaguesusethe following formula^ 
for CTaj^ie, to estimate effect sizes from si^le e^eriments that 
coopare posttest means of indepoident groups: 
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d = — — - , a) 



where d ts the estimator of effect size for a specific study, 

and ^ are sanple means of the experimental group (E) and control 

groi:^ (C) on the dep^tdent >^iabie Y, and is the sa^le 

standard deviation bh_Y^ _Tbey_use dlffereht fbrmttlas fbr 
calculating ^fectslzes_fr^ designs 
ttet rantroi f or Irrei^ in fi 

comparisons of matched groups, ccxqpari sons of gain scores, 
covar lance analyses, multlf actor analyses of variance, etc. (Glass 
et al., 1981, pp. 114-123). 

The recent books on allgive 

Fbrmula l as a baslc_formula for estimtihg_e££TCt_slzes^ bu^ hone 
of _the_bo^s quotes eve^ of the additional fomilas tha^ 
and^his coliugues-^^e used-f or ralculattog effect sizes from^ 
studies with more CQoplex experimental designs.. Hedges and Olkin 
(1985, p. 13), for example^ have sl^ly noted that such formulas 
are outside their dcmain of interest; th^ dp not consider them a 
central issue in the statistics of meta-analysis. Rosenthal 
C1984, E^. 30^31) has referred to these formulas as **fbrmulas for 
adjusting effect sizes** aiid has cautioned ttet those^ calculate 
th^ should also report **unadjusted effect sizes** alongside the 
**ad justed** ones. 

Neither Rosenthal nor other developers of the newer 
statistical methods for mita-axialysis, however, have written much 
about the calculation of **unadjusted** effect sizes with powerful 
eaperimental^ a Recent bbbks bh meta-analysis fbcus almbs 

inclusively bh.calculatlon of ef f ect 5lzes_wlth_ah uhb _ 
^stt^t-bhly deslgh^--The- bhly-desl^ otfaCT tfaah this one covered 
t^^tail in recent books is the cqnoparison of pre-post gains of 
^i^erimental and control groups. What recent methodologists have 
said iftbDUt this design shows how different their af^roach is from 
Glans*s« 



Glass and bis cblleagues provided the following formula for 
calculating ah effect size for this design: 
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(2) 



id)ere is the effect size estimated fron a conparisoh of gain 

scores and ^ and l£ are the average gains for the ei^eri^^tat 

_ _ _ __ 

and cOTtroi groups Tslass et al*,^ ifSi^ pp* 115-1185* _H^^ 
the iramerator of Formulas 1 and 2 axe calculated different ly • ~ ^e 
numerator of Formula 2 is calculated fron group gains rather than 
from group postscores. But formulas (1) and (2) dp not differ in 
dendtninatbrs. For toth designs , the pdsttest standard deviation 
is used in standardizing the mean differences. 

Rosenthal,^ on the other faaxkd, has provided the following 

formula for use with gains analyses (1965, p. 21): 




where df is the ef f «t size estimated by Rosenthal from a 

ccm^^ison of gain scores; ^ and ^ are defined as above; and 5^ 

is the standard deviation of the gain scores and is equal to 

Syi^ 2(1 - r^j I where r^ is the cbrrelatiori between pretest (X) 

aid posttesF*(¥) scores. Note that Rosenthal and Glass's formulas 
differ in standirdizatidn term. Rosenthal uses the_ standard 
deviation of gain scores iii the deh<aninatbr bf his fbrmula^ 
idiereai Glass uses the staxidard deviatibh of the pbsttest i^ his 
f^^la. Effect sizes calculated by Glass and Rosenthal f rcmi the 
^ame gains analysis would therefore be related by the following 
formula: 



d^ = . (4) 

Because pretest-posttest r*s can be quite high in educational and 
psychological research, effect sizes calculated 1^ Glass and 
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RDSinthal can diffifir hy a larg^ afiount. For exal^ler with a: 
gretest-pdsttest corrtlation of ^8, an effect size calculated from 
Fbrnula 2 would be only 66t as large as mt effect size calculatad 
from Formula 3 for the same experiment. 

: kraemer and (1982), who have also contributed to the 
develc^ent:iOf new statistical methods for meta-analysis, have 
criticized Fprmula 3 on the grounds that the standardization tera 
Is wroiig. Th^ have pointed but that Formulas l and 3 do hbt^ 
irtloate the same quantity*- what Is r^ark^le about Kramie^ 
and Andrews 1 discussion p-thetr tTC^ 

analysts use ForiDula 3 to calculate effect sizes from ccxi^arisons 
of gains, tong before kraemer and Andrews published their 
articla. Glass and his colleagues had advised meta-analysts not to 
use Formula- 3 In calculating effect sites for ccmiparlsbhs of gains 
but to use Fprmula 2 Instead. Glass's advice has been either 
overlbbkcMi^ ighbr«it b by recent cbhtributbrs to 

meta-analytic methodology. 



- - The^purpose of ^ review issues involved in 

estimauting ef f Kt^izes from studies that use different ^ 
experimental designs. It covers estlmatlbh of effect sizes for 
both siDple posttestHdnly designs and ccoplex designs that r^nove 
sburces bf irrelevant variatibn from the pbsttest. !Ehis_article 

also presents fbrnmlas for calculating the_ standard error of 

ef f ^t_sizes_estimated_uhder_dif f er • _ Finaiiyr the 

article shows-that average effect sizes and their standard errors 
are of ten miscalculated in meta-analyses that use the newer 
statistical methods for research synthesis. 

Operative and Znterpretable Effect Sizes 

The hbtibh of measurihgeffect sizes was a familiar cmetb 
many social sci^tists long ^bre Glass used indices of effect _ 
sizes as a^k^ tool In meta-analysis^ Cofa^ * s (1977) classic book 
on power analysis in^^^ made extensive use of 

effect sizes tn= estimating the p^er of statistical tests axtd in 
determining sample sizes needed to achieve tests of a given power. 
Cohofi's bddic on power analysis alsb introduced a critical 
distihctibn between two types of effect sizes: ihterpretatfale 
effect sizes axid bperative eff^t sizes ^ 



- -Mien c^culated-^from means effect sizes are 

deti^mtned 17 dividing a treatment effKt e^ressed in raw units 
(y-units) by the standard deviation of T. GDhen used the synOspl d 
to stand for the inte^retable effect size for the pbsttest^oxilyt'' 
indepencleht-group design. He added primes ahd s^ the_ 
symbol d to denote interpretable effect sizes calculated for other 
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experinientai designs. For exaiii>ier Cohen used the sytnboi d| for 

the thterpretabie effect size calculated for one sample of n 
differences Isetween paired observations (1977, p. 49): ~ 



d» - — , (5) 



irtiere 1^ is the mean of the e^ertnental group on a pretest and 

^ is the mean on the posttest. Cohen used the sysibol dj for the 

int^pretabie^fect_size caicuiated for a study to ^^icfa the mo^ 
of ^e es^erinentai g^oop is coopared to a theoretical population 
mean (p. 46). Cohen pointed out, however, that all such 
interpretable effect sizes are cpnceptualiy egui%^lent and can be 
Interpreted on a edoman scale* This is because the stluidardizing 
unit for ixiterpretable effect sizes is always the standard 
deviation of t iS^}. 

(^)erative effect sizes are ah entirely different matter. 
Pperative effect sizes are calculated lv <U.vidihg a treatment 
effect e^reiss^ in Y^units by^either the staxidard deviation of ^ 
or ^-a stax^^d deviation, from s^r ces „ of ^^^iation 

have been removed ^ one or anoth^ ad mecfaanian designed 

to increase pcmer-^ovaxiance, regression, or blocking. Operative 
effect size are identical to interpretable effect sizes only in 
experiments that do not remove jk>urces of Irrelevsuit vau^ 
from the d^i^Qident variable^ e.9.# Can|>l3ell and Stanl4^*s (1963) 
l^pe 6 experiments^ FOr other e^>eri8iehtal designs^ <^>erative 
effect sizes.are calculated w^^ The operative 

effect size d_f or p^red obs^^tions for one sa^le would be 
estimated from (C^en, 1977, p. 63): 



d = _ . (6) 



Coh^ used the symbol d, wil^ subscripts or primes, to 
r^resent c^erative effect sizes calculated for a variety of 
ei^^imental de$ignse Although denoted a cqasion syiiax>l, 
C^erative effect sizes calculated for different eii^erimental 
designs are not conceptually equivalent because different 
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sta^ardlzlhg_miits are used-i^ calculatin ^^erative 
effect sizes cannot therefore be interpreted in a single way* 
Operative effect sizes are useful, however, because they can be 
muplo^wi directly to find values of power in power tables. 

A critical point to grasp^ Heta-ahalysis must be 

based on ihterpretable_effect_sizeS|^ hot 
if it-is to produce interpretable results^ --Operative eff 
have an mndesix able property tlmt use 
in meta-analysis. With a given raw-score unit, they vary not only 
as a function of size of the raw^score treatment effect but also 
as a function of ttie e^erimental design used to investigate the 
effect • Two investigators sttKiying the same phehmenbh who find 
identical treatment eff^ts (lAieh effects are expressed ih raw- 
score uxiits) would rep^ effect sizes if 
they used different research designs to study this treatment 
effect . 

Su£^ser for esaaple, that Investigators A and B are studying 
the effect of a diet-su|^lement program, and each investigator 
finds that the program has the same effect oxi the deE^uiderit 
variable! __ It adds ah average of 10 ppuhds td t^ 
e^erlmental-S^j^ts«- Suppose further ttet lnvestigato^ 
used-a weak-^^wimental design^ estiboate- the eff«t and- test 
its significance (e.q«, a posttest-c^ly design for independent 
groups), lAereas investi^tcr B has used a more powerful design 
(e«g., analysis of covariance with a eovmriate that correlates .9 
with the d^>endeht variable). Even though the raw treatment 
effects estimated by the two investigators are of tlie same 
maghituder the c^>erative ef f ^t size calculated by investigator B 

will 1^ nearly twice the size of the operative effect size 

calculated by investig^or & because ^e-Stande^£L^ t^^--- 
enqpiqyed 1^ investigator A is i^e raw standard de^dat ion, 
the Standardization term canployed by investigator B is a standard 
deviation of residuals from idiicb ii^rtant sources of variation 
have been removed* Ihe intei^retable effect sizes calculated 1^ 
the two investigators, however, will be the same-*^as th^ should 
be for two raw treatnH^nt effects of the same magnitude. 

_-^t is-usefui to k^^-^^s d^sti^cti^ betwe^ op^ ^d 
interpretable effect sizes in mind irtteat^ examii^ and 
3, the formulas ttst Glass and Rosenthal use for calculating 
effect si2es from cc^arisons of ^ins.^ It should be clear that 
the fdrmiila used i:^ Glass is a formula for an interpretiOsle effect 
sizer Rdsahthal*s form is for ah c|>erative effect size^ 
should also_be clear that_61ass*s_for^ e^ropriate one 

to use in meta-analysis, axtd Rosenthal's formula is inappropriate 
for use fiTresearch synthesis. 



Although Glass and his colleagues have not mentioned the 
distinction between c^eratiye and interpretabl^^ sizes ixi 

their writings on meta-analysis^ their ^^m^ suggests that they 
were amre of the iiportahc of using ihterpreta&le_effTC sizes ^ 
rather than operative ones r in rese^ (e^g.r^Siass et 

ai«, 1981)^ The va^ size that tfac^ have 

given are all formulas for interpretable effect sizes. Rosenthal 
(1984), on the other hand, af^^ears to believe that operative 
eff^t sifZes are the a|$)rc^riate ones to calculate in research 
synthesis. Although Kraimer and 

recognize the ih^pr^riat^ess of operative effect sizes they 
appear to believe that such effect sizes are the ones that meta- 
anaiysts actually calculate. 

Standard Error of an Effect size 

One of the major contributions of Hedges to the methodology 
for meta-analysis is the derivation of a_ formula for the standard 
errorof effect. sizes. --Hedges X 1982^ derived thtsfbrmula f or 
standard^ tfror from a. set of ^pilcit assun^tions-abiDut the data 
being axiaiyzed in research syntheses. Careful^uoaj^tion of 
these as sunpttcms will reveal, we believe, that they seldw are 
met t7 meta-analytic data sets. 

Data in the model that underlies Hedges* formulas for 
analysis of effect sizes ccKme from a series of k independent 
studies^ each of which coopares ah experlm^tal^group (£) with a 

control group (C) . Hedges lets ^^^and stand for the ^th 

scores in the ith oqperi^sait f rom the experimental and control 
^oups, respectively. His model assumes that in a given study 1, 

-11 noisily distributed with means and and 

ccxnmoh variance o. Sie peculation effect size for study 1 will 
then be given by the fbllowlhg (Hedges, 19829 p. 491): " 

B r 

tl - u (7) 

8 = . 

a 



Hedges has ezamihed the properties of JSlass'S- estimator d_ 

£^uatlrai^l^_bf thls-f^pu^tl^- He has sto«m that in 

sm^l- samples Glasses d is a biased estimator of tjie population 
parameter. An unbiased estimator of 6 can be approxiseited frcsn 
the following: 
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1 - 



4(S + S - 2) - 1 



S ' 



(8) 



where n and n aure the sample sizes of the es^erimental and 
cbhtrbl groujps. The saioplihg distribution of this unbiased 

estiiDitor is that of a noncentral t times a constant. If both n^ 

and n are large, the distribution of d- or d is approximately 
normal with M(d) ^ 6 and 



n n 



2(n^ + n^) ' 



(9) 



where 5, the pc^lation effect, size^ is the noncentrality 
parameter (Hedges, 1982, p. 492). mis is an infiortaht formula 
because it can be used to approximate the san^lihg error of effect 
sizes estimated under certain conditions. 

|t is important t6-^^hasizer^§^ formula 
applies only to situations that meet the assuniptions made by 
Hedges^ These assunptions include (a) ind^endence of 
esperlmental and control groups, and (b) assessment of results on 
a dependent variable f rom which no sources of irrelevant variation 
have been removed in the es^>erimehtal design. The model fits a 
type of design that Ca^bell auri^ Stahl^_(1963) call a Type 6 _ 
desi^r - the ^sttest^c^y-design for i^ 

desi^ from which valid inferences can be draim_^30ut ^^er^entai 
tx«tXDmts,_faut reseucdters^ use it do not estimate treatment 
effects with great precision. 

Hany social science riseariAers prefer to conduct esperiments 
that estiJoate tr^tmeht effects more precisely. Instead of 

po5ttestH»iay_desighs_ f or ii^ 

e:i^ertmentad-destgns-aa statistical te^nifues-^at r^ove -- 
5ources of^trrelevant var^tira variables. 
Ther use multiply factor or oatded^subject designs; therycoopare 
gain scores of experimental and control groupsj or th^ include 
cp^riates in their statistical analyses. - Such designs r rather 
than the pbsttest--only desipi for independent gro dominate the 
literature of _ the sbcieil sciences. For exaqple^ none of the_14 
studies included in a rec^t meta-analysis oh coacbi^ effects ^ 
the Scholastic Aptitude Test (SAT) used a posttest-only desig^ for 
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Independmt groups JKuIik et aU^ 1984>._ Ml studies iorated for 
this meta-analysts cam^ea gains or covariance-ad justed 
postscores of experimental and control groups. 



Additional Formulas for the Standard Error of Effect Sizes 



-Hedges' Formula 9 wbuia_hot have_yieided_^ 
Mrbr f or My_of the^ effect sizes calculated ^ Kiil^ ai^^ 
Additional formulas- a^e ne^ calculate standard errors frc»Q 
studies ccoqparing lain scores a:^ studies using analysis of 
co^^iance to measure treatment eff^ts. Fortunately, it is easy 
to derive such formulas. To dp so, one tiJces advantage of an 
algebraic relations between Ca) the formulas for calculating 
average effect sizes for various e^erimehtal desighs , and (b) the 
t ratios used_tbtest^ statistical signifies 
treatment effectS-found-With these The relationship 

betweent ratios and effect size formulas has already been 
desDonstrated by Glass and his colleagues (Glass et al. 1961, 
chapter 5). 



Effect_size calculated frcx^ of ihdepehdeht groups . 

Bheh ah exp(»timen has e5tiMted_a_ treatment e^ ^.comparing 
gains of independent _^er control groups, the - 

stgnifi^ce of the effect is usually tested by a t^ratio^ When 
the assuBipticms for the u^ are met, the 

effect size in the eiqperiment can be estimated by Glass's foirmula 
for coiq^aring pdns (Formula 2). Glass and his epilogues (Glass 
et al., 1981, p. 127) have shown that an effect size calculated in 
this ray is equal to the following: 



^ =^ t^ 1^ 2(1 - Pxy)(l/n^ + 1/tF) , (id) 



where is the t ratio for testing a difference in gain scores 
and P^'^is the population correlation, brdihariiy estimated by 
r^^ t^tweeh the pretest (X) and the pbsttest (Y ) • 

It is easy to see from Ecjuatibn 10 that the effect size §^ is 

equal to a constant times the t ratio. When assuiqptipns for the 
use of this statistical test hive been met, the sampling 
distribution of an unbiased estlnatbr of this effect size is that 

of a noncantral t times the constant. With both n and n large, 
the distribution'^of d^ will be approximately normal with M(d^) ^ 8 

and 
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o{d^ = / 2(1 - p^) 



n n 



2(n^ + n^j 



(11) 



Hhen pretest aiid pdsttest correlate TOte than .5, the standard 
error of a study effect calculated from gains will be smaller than 
the standard error of the study effect calculated from the 
posttests only* 



independcait^roups * ^eatment effects are often tested for 
significance using aumlysis of -covariance. When assun^tlons are 
met for use of the t ratio or F ratio for testing the difference 
between the covariance-ad justed means of the e^erimehtal and 
control groups, the effect size can be estimated by the 

following: 



(12) 



where b^ is the pooled within-grdup estimate of the regression of 

final status Y on the covariate X. Glass and his colleagues have 
shown that this effect size is equal to a constant times the t 
ratio calculated for the same design (GlasSr 1981, p. 127). 



-1. + J:, 
n^ n^ 



(13) 



where t is the t ratio for testing the treatment effect for this 
"2 

experimental design and p^^ is the peculation correlation, 
ordinarily estimated hr the same value r^^, between the final- 
status score Yazid the cova^ The sampling distribution of 
ah unbiased estimator of this effect Is that of a z^c^tral t 

: - : : : : : m C - _ _ : 

times the constant. With both n and n large, the distribution 
of d^ will be approximately roraal with M(d^^) « S and 

Si 21 
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a(dgj [1- p|j) 



^ (14) 



2(h^ + h^) 



A ccaparlson of Formulas 14 and 9 shows that standard errors of 
effect sizes calculated from residual scores are genfrally smaller 
than are Standard errors of effect sizes calculated from posttests 



Effect sizes estimted fr^ posttest scores of matched - 

group s * Mien a resear^^ has used a matched-pairs design to test 
the effect of an experimental treatment, the effect size can be 
calculated from Formula 1. with such a design, the statistical 
significance Of the effect is usually tested with a t ratio for 
correlated means. Glass has shown that this t ratio is related to 
the effect size calculated for this design by'the following (Glass 
et al., 1981, p. 127): 



2(1 - Pw.) 



¥¥'^ (15) 



N 



where p^, is the population correlation of the paired posttest 

srores* This sguatlon that ah effect size estimated from 

this design^ is equal to a constant times the t ratio 

calculated from the same design. The san^ling distribution of an 
unbiased estimate of this effect size is that of a t ratio times 
thcs constant. With N large^ the distribution of d^ will be 

approximately normai m(^) ^ S and 



2(i - Pvv.) x2 



'yr ^ (16) 
+ — . 



N 2N 

l^licatibns for Meta-analyses in the Literature 

The lack of explicitness in recent books on meta-analysis 
aUbPUt the heed for a va^ for effect sizes and 

standard errors has led to flawed analyses and flawed ebhelusiohs 
in the research literature. We focus here on three meta-analyses 



o 14 

ERIC 



Effect Sizes - 13 



that: have used the newer statistical methods for research : 
synthesis* The revieKS were selected for special exami nation on 
the basis of the detail that they cCfiitaih. The first and secbzid 
reviews {Secker^_1983;Rb^ Rubih^ 19821_l£st effect.sizes 

axid standard error s_ for ii^ 

(Pearloan, 1984) contains a zramerical calcalation of cumulated 
sanpling error in a set of effect sizes listed in a report ^ 
kullk et al. (1984}. The three reviews contain enough detail for 
readers to reconstruct what was actually done in the analyses. 



ihterpersbhal expectancies ^ To illustrate the utility of 

their method of resear^ synthesis^ i^s^thal a^ St^in (1978, 
1982) applied their test of faomogen to ^ 

findings on interpersonal expectancy reported originally in a 
dissertation by KeshCKik (1971). Subjects in Keshock's 
dissertation- were 48 black inner city beys in grades 2 through 5. 
Half the children at each grade level were assigned to the 
expwimmtal treatmeht^ a^ that these 

children showed_an-^illty level one_standard devlat^ 
than their actual scores^ The teachers of the control childr^-_ 
were given the^tildren^s actual ability scores » Keshocfc reported 
that this experimental treatment had no significant effect on the 
children's achievoDfnt scores, as measured by the Wide-Range 
Achievement Test (WRftT). 

Rosenthal and Rubin ebheluded that Keshock*s analysis was 

tnadequat e because- it f ai ied_ to take into account. grade*level 

differences in tr»tm^t effects. -They reanalyzed Keshodc^s d^ 
using the HIAT gain scores that Reshodc listed in an appendix to 
his dissertation. Rosenthal and Rubin r ported their conclusions 
succinctly: 



Gains in perf brmahce were substantially greater for the 
chlldren-^bse teacher s_had_beeh_led_ to e^ect greater gains 
in perf oraance^ sizes of the effects varied a^ 
four grades from nearly half a standard deviation to nearly 
four standard deviations. For all subjects craibined, the 
mean effect size was 2.04. (P. 383). 

It is hot at all obvious how Reshock could have Mssed a^ 

ef f Mt of^tbis magn^ _ebheh i 1977 ) has_given_r6ugh_guidelihes 
for interpreting effect siz^ effects of 

atout 0.8 standard deviations shouid_^^^ large; tfa^ can 

usually be detected l:^ eye without the aid of special measuring 
tools. Although Rosenthal and Rubin Vs reanalysis showed an 
average effect size. of 2.04, Keshock had not noticed any effect of 
teacher expectations on HRAT scores. Nor did his original 
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statistical analysis disclose such efferts. How cduid Keshock 
have failed to note so large ah effect? 

Hoif also c6u3^^^sfa6ck's e^erimental treatment have produced 
an effect of this magnitude? Tochers of children in the 
e^erimental group were sii^ly told that their children had iQs 
one stimdard deviation M their actual IQs. iU^cording to 

Rosenthal and Rubihr this sin|)le manipulation raised achiev^ent 
scores of experimehtalr^oup children 1^ ah average of Xmo 
stahd^d d^iatlons^- Scores of second rose_by an average 

of aimogt- four _ standafd d^dattons j -^ese s^e ^6ra6us_ gains i _ 
One standard deviation on an IQ scale is equal to 15 points, for 
ezanple; a ^in of two standard deviations on such a scale is 
equivalent to 30 points, and a gain of four standard deviations is 
equivalent to 60 points. 

Rbsehthaland Rttbihis results are hot so paradoxical as they 
may at first a^ear to be« _H6senttel_and Su&in calculated effect 
sizes from Keshock 's t ratios for testing the significance-of-a 
difference in gains, but they did not use the appropriate formula 
(Equation 10) • Instead, they converted the 4 ratios to effect 
sizes using an equation apprqpriate for the % test for cxms^ixiq 
final-status scores: 



d = t / i/n^ + i/n^ 



(17) 



Rosenthal and Rubin therefore e^ressed the treaitment effects for 
Keshock- s study not in terms of variation in achievanent but 
rather in terms of variation in achievem e nt g ains ■ Such effect 
sizes are hot ihterpretable oh the same scale as are other effect 
sizes. 

- Coxtv^tiraal eff^ estlMted easily for ___ 

keshock' s e^eriment. Scores on the HRI^ reading and arithmetic 
tests are standardized scores with a mew of 100 and a standard 
deylation of 15. Raw treatment effects reported by keshock on the 
RRAT were 5.54 points in reading and 5*91 |K>ints in arithmetic.^ 
These gains are axmlogbus_ to increases of this magnitude oh ah IQ 
scale^ Just-as it muld be misleading tb_refer to a gain of S l@ 
points as repres^ttog 2^0^ 

to refer to a gain of this size on the ptftf as equivalent to ah 
effect size of 2.04. The average effect size on the HRAT was 0.37 
standard deviations in reading and 6.39 standard deviations in 
arithmetic. 

Table 1 cOTpwes our esttM sizes for Keshbck^s_ 

data with estimates made by Rosenthal and Rubin, bur calculations 
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ot coBfoslte effect sizes are based du a reported^ correlation of 
• 7 between arithmetic and reading scores on the WRAT. Our 
estimated. st&ndard errors are based on retest correlations of ^93 
for WRJ^ cotq>bsite scores i 



Sender-aiBi SQScept ibiiity to^inf i;ience > Becker^s (1983) 
effect size synthesis on this topic used Hedges' < 1982} 
homogeneity a|yproacdi. JHote than 100 studies in this area were 
brlgl'ially located l:^ Eagly^ who used both a bbx-^scbre approach 
surid meta-a^lytic oietbodb^ tb_ integrate the results of the 
stales itogiyr l93fli Bagly 6 GMli^ 1981U Eag^^ malyses 
showed that males md females differed significantly in 
susceptibility to ix^luence in the three areas of persuasion, 
conformity, and group pressure. Becker thought, however^ that a 
reanalysis of Eagly's data was needed.^ To her, Eagly's box- score 
approach seemed clearly inadequate, and even Eagly and Carli's 
meta-analytic methods seaned tb be ad hbc . 

- -_ Beck« tf^Jted-in^vidualiy each of^^ effect sizes- ^ ^ 
reported ^ Eagiy and Carli, and she reported that they were quite 
accurate for the most part^ Only a handful of the nearly 100 ^ 
effect sizes recalculated by Becker differed by as much as 0.10 
from Eagly and Carli's briginal estimates. Becker used her own 
recbded effect sizes rather than Eagly and Carli's estimates in 
her reanalysis of results in this area. 

--- Becker's reanalysis led her to question Eagly and C^ 
inclusions • Like^ Eagly and Carli, Becker found average 
differences between males and females in susceptibility to 
influence, but she did not attach much weight to this finding. -__ 
More ljq)ortant to her was her bbservatibn that variation in study 
effects was more clbsely related tb studir methbdblbg^ 
tb indieatbrs of sex bias^ Becker pointed but ^_fbr_e^^ that 
vtf iati^-ih_results in- studies ras related to type 

of outcome- variable used in the study--a methodological feature- 
rather than to such factors as gender of the investigator, gender 
bias in message, and so on« Studies that used postscpres on an 
attitude measure as the outcome variable (Group Z studies) 
produced neu-zerb effects; studies that used chaurige sscores^ 
cbvariahee-adjtisted scores^ or their equivalent (Grbup II studies) 
produced more sizeable effects. 



It is possible to evaluate Becker's conclusions because she 
provided a figure showing effect sizes and standard errors of 
these effects for the 36 per suasibh studies . Cbn^>aring Bee ter ' s 
statistics wlth results in the briginal repbrts shbw^ that she 
calculate operative effect sizes for all studies— ho matter what 
type of experimental design or depoident variable in: s used in a 
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study. Such eff^t sizes do not eistimate the same quantities for 
different research designs. 

Becker's calculations led her to reach the following 
concltision: 

Hethodolpgicai considerations rather than features 
r^resenting sex bias e^laiii the varl^ility in persuasion 
study results .The TO stable sex difference for persuasion 
studies. that had_been_hoted by Macc^jy and aacklin_il97^ 
seens to largeiy_ spurious . Giv^ th^ for 
the size of the dlffer^ces idth this methodological 
artifact, claims of sex bias must hold less sivay. (P. 13). 

It seals to tis that it is not a methodological characteristic of 
the studies that e^lalns the different effect sizes in Group I 
azid Srbup Il e^erimehts. It is rather Becker's method of 
calculating effect sizes that explains her finding^ (^oup 1^ 
treatment effects were standardized on a final-status measure, S^; 

Group II treatment effects were staxidardlzed oh measures of gain. 



- — — — — — - " 2 

S^ = Sy|/'2(1 - r^J , or on residual insures, S^ = S^i^l - r^ . 

Given the differ^ce in the units on which Group I aHd II 
treatm^teffectsare standardized^ meaningful cdhcluslbns cannot 
be draim from an analysis of the c^^xned data set. 

C o a c h i ng effects . Pearlman's (1984) effect s^ 
ras based on 38 studies of coaching effects originally analyzed by 
Kullk et ai. (1984) • Kulik and his colleagues had concluded frcxn 
their meta-fluialy sis of results reported in thi studies that : 
coaching programs in general have positive effects bh aptitude 
test performance^^- fh^_cautiohedr hbwc^er^ that r^ 
in tm literatures onucMchih^ literature on the SfiT and the 

literature on other ^titude tests^ Kiilik aaul his co-wor^ 
unable to ei^lain much of the variability in effect sizes within 
these two literatures, and they concluded that "it was is^ssible 
to e^lain fully lAiy coaching results differ frcmi study to study 
as much as they do" (p. 187). 

Pear Iron analyzed tfae-^ta-ass^bled 
workers using the methodology for effect size synthesis developed 
by Hunter et al. (1982). Pear lman*s analysis led to conclusions 
that differed from Kulik *s on all major points. Pearlman 
concluded, for example, that coaching effects are small for both 
the SAT and for other tests ;i Bbst i^pbrtaht^ Pearlma^ concluded 
that a substantial portion of the o^^ effect 
sizes was attributable to a statistical artifact — sanplihg error — 
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rather than to true differences in results from different studies. 
Pear loan interpreted his results as suf^rtinq the conclusion that 
between-stiKly differences in coaching effectiveness are much less 
extensive than they appear to be. 

-- Close^camination of Pear lmah*s calculations^ they 
are seriously flawed* These flaws can be se^ most cleetriy in 
Pearlman's analysis of results from 14 SftT studies included in the 
total pool of 36 studies. Pearlman calculated the total saspling 
error in thi 14 studies using a cumulation formula (Hunter et al.r 
1982^ p. 102) that is closely related to Formula 9 and is 
appropriate f brcalculatihg sampling error with independent-group, 
posttestronly efperlm^tal d the_14 S^ studies, 

used such a design^ howeyer^^ and^n^e of-the 14 effect sizes for 
these studies was calculated frcaa such formulas as 1 or 17« The 
formulas that Kulik and his co*workers used to esstimate effect 
sizes weri formulis for interpretable effect si^es: Formulas 2, 
10, 12, 13, and 15. The appropriate formulas for calculating: 
standard errors of these effect sizes are Formulas 11^ 14^ and 16. 



-- ^e results of Pear imanls failure to t design 
iiito accoimt are^serJons^ He es of the 

variance in the distribution of SftT study effects could be 
attributed to saspling errors in the individual studies. If he 
had taken into accpuntzthe fact that all 14 SAT studies used pre- 
|M3st designs and that SAT retests correlate .88, he would had 
to conclude that sailing error of effect sizes in individual 
studles_cbuld_nbt ac»TOt for TOre than^ __ 
variation study ef fee s^e of tte SAT stupes used 

additional coyariates, matted-pairs designs, and factorial- ~ 
designs that further reduced sanpling errors, the true proportion 
of study effect vmriation attributable to saotpling error may be 
even lower. 

Conclusions 

A|^§9^^i^^y^^^»rdi^ have fbtind Glass's me^ 
methodology appealing and useful, seme methodologists have 
criticized the statistics that underlie this methodology. Hedges 
and OlRin (1982), for exaople, have described Glass -s procedures 
as ad hoc and generally lna|:{)rc^r^^^ have also asserted 

that huxidreds bf meta-analyses 1^ on Glass 's mbdel have 

used statistics that we that are 

demonstrably Incorrect (Hedges fi Oikin, 1985, p. ii)^ ^Tiie 
conclusions of these meta-analyses may indeed be correct,** Hedges 
and blkin have written, **but the statistical r»soning in support 
of these conclusions is not** (1985^ p. 14). 
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: -Hedges and Olkln (1985), Hunter^ Schmidt, and Jackson (1982), 
and Rosenthal (19841 have In recent books tried to firm up the 
statistical basis of meta-analysis. Careful exami^ their 

bbbkS- shotfs ^ however that _they_ fail to distihguish between 

interpretable and operative_effect slzes^ aM 
guidelines on the ralcuiation of interpr etable_ef feet sizes for 
studies with different experimental designs. Their failure to 
consider the influence of e^erimental design on effect size 
calculation seriously limits the utility of their work. 

__ _It_is no surprise to find that user£> of the statistics 
advocated in these rec<mt books have produced works with serioiiS 
flaws: 

These meta-analyses have of ten been based on operative rather 
than interpretable effect sizes. Because o|^atlve effect sizes 
cure usually calculated with reduced standardization termSf 
bperativeef f ect sizes are inappropriate for use in zoeta-ahalysis. 
When bperativeeffectsizes rather them interpretable ones are 
used-in research syntheses^an «tlf actral re^ 
tetween effect sizes and experi^ designs, and average effect 
sizes often become seriously inflated. 

2. These rtieta-analyMS have of ^ : 
stsmdard errors. Hethddpldgists who have written sOsout staxldard 

errors of effect sizes have_presehted cmly 

calculating such_st^dard errors^ This for^la gives r 
results only when allied tO-Studies using an unblocked, posttest^ 
only design^ The formu produces inaccurate results i^en applied 
to studies that estimate effects with greater precision — or for 
the majority of studies in the social sciences. 

: We believe that valid conclusions c^ be drawn from meta- 
analyses in wbich_eff ect sizes are mis 
concluslons_be draim_^en meta-analy^ 

errors to test the homogeneity of collections of effect sizes* We 
are therefore more pessiooistic in our assessment of r^ent meta- 
analytic work than Hedges and Olkin were in their evaluation of 
earlier meta-analytic results. We believe that both the 
statistical reasoning and the conclusions ar^ likely to be 
incorrect in studies that have used the newer statistical methods 
for research synthesis. 
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Table 1 

Conparlsbn of Effect Sizes in Keshock's (1970) Study 
Estimated 6y Different Formulas 



Grade 



Effect Size 



Estimated by 
Rosenthal & Rubin (1982) 



Estimated from 
Formulas 2 and 11 



2 
3 
4 

5 



M 

3.85 
2.34 
0.47 
1.48 



SD 
1.03 
0.78 
0.58 
0.66 



H 

0.89 
0.53 
b.ll 
0.34 



SD 
0.22 
0.22 
0.22 
6.22 



Mean 



2.04 



0.73 



0.44 



0.11 
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